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1  INTRODUCTION 


Beamfomiing  is  a  classical  technique  used  in  array  signal  processing  to  determine  the 
location  of  a  source  that  is  radiating  energy.  A  large  number  of  sensors  needs  to  be  used 
to  obtain  sufficient  gain  and  thus  accurately  detect  a  distant  target.  The  computational 
complexity  of  a  beamformer  increases  as  the  number  of  sensors  used  increases.  Parallel 
processing  is  the  obvious  choice  when  a  large  number  of  sensors  is  used.  In  section  2,  we 
give  a  brief  description  of  beamforming  and  the  Minimum  Variance  Distortionless  Re¬ 
sponse  (MVDR)  beamformer.  In  section  3,  we  discuss  the  iWarp  architecture  and  the 
node  connection  topologies  we  used.  Sections  4,  5,  and  6  discuss  implementations  of 
Cholesky,  QR,  and  MVDR  algorithms,  respectively.  Section  7  presents  performance 
measurements  from  those  implementations.  The  last  section  gives  a  brief  summary  and 
discusses  some  future  investigations.  The  complete  MVDR  code  for  running  on  a  fully 
connected  network  of  four  processors  is  also  included  as  appendix  A. 
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2  BEAMFORMING  AND  THE  MVDR  BEAMFORMER 


In  this  section,  we  give  a  brief  description  of  beamforming  and  a  derivation  of  the 
MVDR  beamformer.  For  a  detailed  discussion  on  the  subject,  see  references  1  and  2.  We 
assume  a  plane-wave  signal  s(t,  x)  is  propagating  in  a  medium  with  the  form  at  a  spatial 
location  z  and  time  t: 

*(')  =  {'  * 

where  -4o  is  the  direction  of  the  propagating  wi  and  c  is  its  speed.  We  also  assume 
there  is  an  array  of  n  sensors  present  in  the  medium.  Each  sensor  will  record  the  acoustic 
field  at  its  spatial  position  with  little  interference  with  each  other.  Thus  the  waveform 
measured  at  the  spatial  position  z.  of  the  ith  sensor,  denoted  by  is  given  by 

J,(()  .  /t  + 

where  N.  (f)  is  additive  noise  measured  at  z,. . 

In  using  beamforming  to  determine  the  direction  of  an  acoustic  source,  the  outputs 
of  the  sensors  are  summed  with  weights  and  delays  to  form  a  beam  y(t): 

y(t)  = 

The  basic  idea  of  beami^.  ning  is  to  adjust  the  sensor  delays  so  that  a  signal  presumed  to 
be  propagating  in  direction  -Iq  will  be  reinforced,  and  signals  propagating  from  other 
directions  will  not.  The  ideal  case  is  that  each  sensor  delay  r.  cancels  the  signal  delay 
z,  •  ^q/c  so  the  signal  would  be  completely  reinforced.  To  search  for  the  source  signal,  the 
energy  in  the  beam  y{t)  is  computed  from  many  directions-of-look  k  by  manipulating  the 
delays  The  maxima  of  this  energy  as  a  function  of  k  correspond  to  the  acoustic 
sources. 

To  compute  the  beam  energy  in  the  frequency  domain,  we  take  the  Fourier  transform 
of  the  beam.  For  a  single  propagating  signal,  the  Fourier  transform  of  the  beam,  denoted 
Y(f,  k),  is  given  by 

Y(f,k)  = 

I 

where  S(f)  is  the  Fourier  transform  of  the  signal  5(0- 

The  energy  in  the  beam  can  be  computed  by  evaluating  J\Y(f,  k)\2df.  For  a  narrowband 
signal  with  all  its  energy  concentrated  at  the  frequency  /„ ,  the  beam  energy  P(k)  is  given 
by 

where  is  a  normalizing  factor  of  the  signal. 
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It  is  now  convenient  to  write  equation  1  in  matrix  form.  Define  the  column  vector  X  to 
be  the  temporal  Fourier  transforms  of  the  array  outputs  (the  signal  data  vector)  and  the 
“steering  vector”  A  to  have  elements 

.  \  .2nf 

A,.  =  QiexpXj — -z.  •  k 

Then  Y{f,  k)  =  A'X,  where  '  denotes  the  conjugate  transpose.  The  energy  in  the  beam 
when  steered  in  direction  k  is  given  by 

P{k)  =  EIITC/; /:)!"]  =  E[\A'Xt]  =  A'R^A,  (1) 

where  =  £[XX']  is  the  spatial  correlation  matrix  of  the  sensor  outputs.  For  a  general 
narrowband  signal,  its  spatial  correlation  matrix  R^  in  the  presence  of  noise  can  be  written 
as  (see  reference  1) 

R^  =  E\{X  +  N){X  +  m  , 

where  X  and  N  are  vectors  related  to  array  outputs  of  signal  and  noise,  respectively. 
Clearly,  the  quantity  E{XX')  is  hard  to  compute.  In  practice,  R^  is  often  estimated  by  XX’, 
where  X  is  an  m  x  n  matrix  with  m  <  n.  The  row  dimension  m  is  the  number  of  sensors 
and  the  column  dimension  n  is  the  number  of  “snapshots”  in  time. 

MVDR  beamforming  is  a  high-resolution  array  signal  processing  algorithm.  It  is 
derived  by  finding  the  steering  vector  A  that  yields  the  minimum  beam  energy  subject  to 
the  constraint  that  A’E  =  1,  where  E  represents  an  ideal  plane  wave  corresponding  to  the 
direction-of-look.  The  purpose  of  the  constraint  is  to  fix  the  processing  gain  in  the  direc- 
tion-of-look  to  be  unity.  The  minimization  of  the  resulting  energy  thus  reduces  the  contri¬ 
butions  to  this  energy  from  sources  and/or  noises  not  propagating  in  the  direction-of-look. 
This  constraint  minimization  problem  can  be  converted  to  a  global  minimization  problem 
by  using  a  Lagrange  multiplier.  Thus  we  need  to  find  a  vector  A  that  minimizes  the 
quantity 


A'R^  A  +  a(A'E  -  1)  . 


It  is  not  hard  to  find  the  solution  A  to  be 


E’RZ^E 

C 


Therefore,  the  energy  in  the  beam  when  steered  in  the  direction-of-look  determined  by  E 
becomes  (see  equation  1) 


MVDR 


1 

[E'R-^E)  ■ 


(2) 


Equation  2  provides  the  quantity  that  we  need  to  compute  on  a  distributed-memory 
machine.  To  find  a  radiating  source,  we  compute  the  spatial  energy  spectrum  by  evaluat¬ 
ing  equation  2  for  many  direction-of-look  Es  and  determine  the  location  of  local  maxima. 
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In  general,  the  correlation  matrix  is  complex.  For  the  simplicity  of  data  structures,  but 
without  loss  of  generality  for  our  purpose,  we  used  real  matrices  in  our  parallel  algorithm 
implementations.  Clearly,  we  also  need  to  assume  that  the  matrix  is  invertible.  In  our 
implementation,  we  assume  X  is  full-rank.  One  approach  to  compute  the  right-hand  side 
of  equation  2  is  to  do  a  Cholesky  factorization  on  the  matrix  R^ .  Another  approach  is  not 
to  form  the  matrix  R^  =  XX'  and  compute  Instead,  a  QR  factorization  on  X  is  per¬ 
formed.  The  second  approach  has  some  advantage  in  terms  of  numerical  stability.  We 
implemented  both  parallel  Cholesky  and  parallel  QR  algorithms  and  compared  their 
performances.  For  the  MVDR  beamforming  computation,  we  used  the  QR  approach. 
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3  THE  iWARP  ARCHITECTURE  AND  INTERPROCESSOR 

COMMUNICATIONS 


The  iWarp  system  is  a  distributed-memory,  scalable,  multiple-instruction,  multiple- 
data  (MIMD)  parallel  computer.  The  system  is  equipped  with  an  interesting  communica¬ 
tions  library  that  supports  systolic  communications  between  its  processors.  Since  the  sys¬ 
tolic  communication  mechanism  on  iWarp  offers  “door-to-door”  data  transfer  between 
CPUs  on  different  processors  instead  of  going  through  local  memories,  the  iWarp  system 
should  be  ideal  for  fine-grain  (communications-intensive)  applications,  for  example  the 
digital  signal  and  image  processing  applications.  The  system  we  used  has  64  processors 
(or  nodes)  that  are  physically  connected  into  a  two-dimensional  toroidal  mesh.  The  front- 
end  machine  is  a  Sparc  workstation  that  contains  a  Sun  Interface  Board  (SIB).  The  SIB 
can  be  used  either  as  a  special  I/O  node  or  just  as  other  nodes  on  the  system. 

Each  iWarp  node  consists  of  three  distinct  components:  a  computation  agent,  a  com¬ 
munication  agent,  and  a  local  memory  unit.  The  computation  agent  performs  the  normal 
programmed  computation.  The  communication  agent  handles  the  I/O  from/to  adjacent 
nodes.  The  local  memory  unit  provides  access  to  local  memory  for  both  the  computation 
and  communication  agents.  Each  node  on  the  system  has  a  0.5-megabyte  base  memory, 
which  can  be  upgraded  in  0.5-megabyte  increments.  Independent  communication  and 
computation  agents  make  it  possible  to  overlap  communication  and  computation.  Nonad- 
jacent  nodes  in  the  array  can  communicate  without  necessarily  disturbing  the  computation 
on  intermediate  nodes. 

Each  iWarp  processor  has  an  upper-limit  speed  (or  “peak  performance”)  of  20 
MFLOPS  on  single  precision  floating-point  operations  and  10  MFLOPS  on  double  preci¬ 
sion  operations.  The  bandwidth  of  interprocessor  I/O  is  40  megabytes  per  second.  So 
roughly  speaking,  a  minimum  of  four  single  precision  operations  are  needed  on  each 
word  (4  bytes)  of  data  being  sent  to  gain  through  parallelism. 

Interprocessor  communications  are  realized  using  the  message-passing  mechanism 
provided  by  the  PathLib  library  on  the  system  (references  3  and  4).  To  do  message  pass¬ 
ing  in  a  C  program,  one  first  needs  to  create  connections.  The  message  passing  along  the 
created  connections  is  done  by  calling  a  send  or  receive  function  in  the  PathLib.  Unlike 
some  other  parallel  systems  (e.g.,  the  Intel  Touchstone),  the  task  of  routing  a  certain 
user-specified  connection  topology  on  the  iWarp  is  left  entirely  to  the  programmer.  In 
other  words,  the  programmer  is  responsible  for  explicitly  specifying  the  routes  to  be  es¬ 
tablished  for  the  connection  topology;  the  programmer  must  also  appropriately  allocate 
resources  required  for  the  type  of  connection  he  chooses.  This  lack  of  “automatic  routing 
by  the  system”  clearly  puts  some  lower  level  programming  burden  on  the  programmer. 
The  gain  from  this  kind  of  system  design  is  the  flexibility  and  efficiency  to  be  explored  by 
a  programmer  for  his  particular  applications.  Though  there  are  some  programming  tools 
available  for  some  simple  image  processing  applications,  it  is  difficult,  at  least  at  this 
time,  to  get  decent  performance  on  the  iWarp  for  most  signal  and  image  processing 
algorithms  by  using  these  tools.  We  did  all  our  implementations  in  C  using  the  PathLib. 
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To  establish  a  communication  channel  between  a  pair  of  iWarp  nodes,  called  the 
logical  channel  in  the  iWarp  system  manual,  a  few  things  must  be  done:  initializing  the 
PathLib,  allocating  Pathway  Control  Table  (PCT),  requesting/accepting  connections,  and 
identifying  incoming  connections.  Some  care  is  needed  to  avoid  PCT  allocation  conflict 
when  one  wants  to  set  up  a  richly  connected  topology. 

To  send  or  receive  a  message  over  a  logical  channel,  the  programmer  needs  to  bind  an 
input  gate  or  output  gate  to  a  proper  handler,  which  is  returned  by  the  request  or  accept 
connection  function  calls.  A  message  can  be  delimited  by  a  header  and  a  trailer.  Though 
it  may  not  be  necessary  to  use  a  header  and  a  trailer  for  a  simple  communication  style 
(e.g.,  a  unidirectional  ring  connection),  they  are  useful  for  more  complex  communication 
requirements. 

In  our  implementations,  we  used  a  unidirectional  ring  topology  and  a  fully  connected 
network  topology.  In  the  unidirectional  ring  topology,  each  processor  has  a  unidirectional 
communication  channel  with  its  two  neighbors.  Each  processor  can  send  messages  to  its 
downstream  neighbor  and  receive  messages  from  its  upstream  neighbor  once  the  direction 
of  the  ring  is  determinea.  This  type  of  sparsely  connected  topology  is  suitable  for  applica¬ 
tions  that  require  pipelining  communications.  The  communication  overhead  will  be  arge 
if  a  processor  wants  to  send  messages  to  a  distant  processor  on  the  ring.  In  the  fully 
connected  topology,  we  set  up  a  bidirectional  connection  between  each  pair  of  processors. 
Thus  only  nearest  neighbor  communications  are  required  with  such  a  topology.  There  are 
two  reasons  for  choosing  these  two  extreme  cases  of  connection  topology:  (1)  we  want  to 
test  the  feasibility  of  setting  up  various  connection  topologies;  the  success  of  making  these 
two  connections  should  give  us  the  experience  for  making  a  variety  of  other  interesting 
connection  topologies,  and  (2)  we  are  also  interested  to  see  the  performance  difference  in 
our  applications  using  these  two  connection  topologies. 
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4  A  PARALLEL  CHOLESKY  ALGORITHM  IMPLEMENTATION 


As  mentioned  in  section  2,  the  Cholesky  factorization  can  be  used  to  compute  the 
right-hand  side  of  equation  2.  Since  the  correlation  matrix  is  symmetric  (we  only 
consider  the  real  matrix),  it  has  the  Cholesky  factorization  =  GG',  where  G  is  a  lower 
triangular  real  matrix. 

To  design  a  parallel  algorithm  for  computing  the  Cholesky  factorization  on  an  unidi¬ 
rectional  ring  topology,  we  first  need  to  make  a  decision  on  how  to  partition  and  distribute 
the  matrix.  We  choose  to  partition  the  matrix  by  columns.  Let  us  assume  we  want  to  use  p 
processors  on  the  system.  A  partition  of  an  m  x  n  matrix  (for  the  simplicity  of  notation, 
we  assume  n  can  be  divided  by  p)  could  be 

“  [^l>  ^2’  ’  ’ 

where  s  are  m  x  p  submatrices  and  kp  =  n.  We  use  the  letter  k  in  all  our  algorithms  to 
be  the  number  of  column  partitions.  We  distribute  the  jth  columns  (/  =  1  ■  •  -p)  of  the 
submatrices  i?,.  (/  =  1,  •••k)  onto  the  processor  j.  Thus  the  processor  j  will  hold  columns 
with  indices  j,  j  +  1,  •••,  j  +  (k  -  l)p.  A  more  concrete  example  should  explain  this 
approach  better.  Suppose  we  have  a  matrix  written  in  column  notation 

A  =  [c2^,(Z2,  ’  ’  ’ ,  flg]  1 

where  ajs  are  columns  of  A.  After  partitioning  and  distributing  the  matrix  A  onto  four 
processors  using  our  approach,  processor  1  contains  Oj  and  Oj,  processor  2  contains  Oj 
and  flg,  processor  3  contains  and  a^,  and  processor  4  contains  and  a^. 

This  approach  of  partition  and  distribution  of  matrix  columns  is  clearly  suitable  for  a 
column-oriented  parallel  Cholesky  algorithm,  where  m  =  n. 

Comparing  the  ;th  columns  of  both  sides  of  the  equation  R^  =  GG',  we  get  the  vector 
equation 

i  =  1 

where  :  is  a  range  symbol  that,  when  no  range  is  explicitly  specified,  implies  the  entire 
row  or  column  dimension.  We  obtain  from  the  above  equation 

y  -  1 

G(j  :  n,J)G(j,J)  =  R^(j  :  n,j)  -  ^  G(j  :  n,i)G(j,i)  =  a(j  :  n)  . 

i  =  1 

Since  G(j,  j)  =  a{j),  it  follows  that  G(j  :  n,  j)  =  a(j  :  n)/Joi (J) .  So  a  column  version  of  the 
(nonparallel)  Cholesky  algorithm  is  as  follows: 

Column  Choleksy  Algorithm 

for  J  =  1 :  n 

Q(i  :  n)  ^  Rc{j  :n,j) 
for  i  =  1  :  j  -  1 
for  k  —  I  :  j  -1 

a{k)  =  a{k)-G{k,i)G{j,k) 
end 
end 

G{j  :  n.  j)  ^  a(i  :  n)/v/a(7) 

end 
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We  now  describe  a  parallel  Cholesky  algorithm  based  on  the  above  column  algorithm, 
which  can  be  found  in  reference  5.  Note  that  in  the  column  algorithm,  we  need  to  com¬ 
pute 

a(j  :  n)  =  R^(j  :  n,f)  -  ^  G(j  :  n,  i)G(j,  f)  . 

i  =  1 

followed  by  the  scaling  G(j  :  n,  j)*~Oi(j  :  n)/jQi(l)  ■  Suppose  each  of  the  p  processors  holds 
k  column  vectors  of  as  described  at  the  beginning  of  this  section.  The  task  of  each 
processor  is  to  overwrite  these  k  columns  with  the  nonzero  portion  of  the  corresponding 
columns  of  the  Cholesky  factor  G.  Notice  that  a  column  GQ  :  n,  j)  i,jnerated  by  one 
processor  is  generally  needed  by  all  the  other  processors.  So  a  column  G(j  :  n,  j)  gener¬ 
ated  by  a  processor  Proc{i)  will  be  circulated  around  the  ring.  At  each  stop,  the  visiting 
G(j  :  n,  J)  is  incorporated  into  all  the  local  Qc(i)  for  i  >  j. 

By  counting  the  number  of  received  GO'  :  n,  j)  columns,  a  processor  can  determine 
whether  it  is  its  turn  to  generate  a  G(j  :  n,  j)  column.  For  example,  if  Proc{i)  has  received 
/  -  1  columns  and  i  e  {/,  i  +  p,  •••,/+  (^  -  l)p},  then  it  knows  that  it  is  its  time  to 
generate  G(i  :  n,  i).  Here  is  the  ring  parallel  algorithm  for  the  processor  i: 

Ring  Parallel  Cholesky  Algorithm 

«  ^  0,  i  4—  0,  ji  1; 

last  *-  i-\  [k-  l)jj; 

while  j  ^  last 
ifj-f  1  e 

J  4-  3  +  1; 

Generate  G(j  :  n,j)  in  gioeU  '•  «)  and  copy  it  into  Rioe(j  ■ 

it  s  <  k 

scnd{gioc{j  :  n), right)-, 
update  Rioc{-.,3i  +  1  :  *); 

3  3  +  1; 

end 

Ji  =  Ji  +  1; 
else 

receive{gioc{s  +  1  :  k,left)-, 

if  3  +  1  G  {right,  right  +  p,  •  •  • ,  right  +  {k  -  l)p} 

»end{gioc{s  +  1  :  n),  right)] 
end 

3  ♦-  3  +  1; 
update  Rioc{:,ji  :  *); 
end 

end 

A  few  remarks  on  the  variables  appearing  in  the  above  algorithm  are  in  order.  At  the 
beginning  of  each  while  loop,  the  variable  s  indicates  the  largest  global  column  index 
among  those  columns  that  have  been  overwritten  by  their  corresponding  G  columns.  The 
variable  j  is  used  to  simplify  the  notation  for  5  +  1.  The  variable  points  to  the  next 
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column  of  that  will  produce  a  G(j  :  n,  j)  column.  Also  note  that  a  processor  does  not 
send  the  circulating  9ioe  to  its  right  neighbor  if  its  right  neighbor  is  the  generator  of  gioe  . 

We  implemented  the  above  ring  parallel  algorithm  on  the  iWarp  system  with  a  maxi¬ 
mum  of  16  nodes.  It  is  possible  to  write  a  program  that  is  scalable  on  the  number  of 
processors.  The  amount  of  physical  memory  on  each  node  does  put  a  constraint  on  the 
size  of  the  problem  we  can  test,  and  thus  the  number  of  processors  to  use.  The  current 
software  on  the  system  supports  the  message  passing  in  the  unit  of  a  word  in  a  C  program 
(see  reference  6).  We  therefore  wrote  a  few  functions  that  enable  us  to  send  and  receive  a 
message  with  a  length  of  arbitrary  number  of  bytes.  After  a  ring  connection  of  nodes  are 
set  up  using  logical  channels,  we  assign  each  node  a  ring  ED.  The  above  parallel  algorithm 
can  thus  be  imp  lemented  on  the  system  without  any  modification. 


5  A  PARALLEL  QR  ALGORITHM  AND  ITS  IMPLEMENTATION 

QR  factorization  is  another  way  to  compute  the  quantity  in  equation  2.  Note  that  the 
correlation  matrix  is  the  product  of  the  data  matrix  X  with  its  transpose.  So  we  can  do  the 
QR  on  X,  which,  in  our  implementation,  is  assumed  to  be  a  full-rank  real  matrix.  As  will 
be  shown  in  the  next  section,  we  need  to  store  both  Q  and  R  factors  to  compute  the 
MVDR  beamforming.  There  are  several  approaches  to  do  the  QR  factorization  on  the 
matrix.  Since  we  assume  X  is  full-rank,  we  can  use  a  QR  algorithm  without  column-pivot- 
We  use  the  Householder  transformation  method,  which  has  been  shown  to  be  numeri- 
stable  (reference  7).  Given  a  matrix  A  with  m  >  /i,  the  sequential  algorithm 

in  reference  7  as  follows: 

The  Householder  QR  Algorithm 
for  1 :  n 

v(j  :  m)  ♦-  house(i4(i  :  m,  j)); 

Mi  :  m,  j  ;  n)  4-  row.hou8e(A(j  :  m,j  :  n),v{j  :  m)); 
if  j  <  m 

A(j  +  l:m,j)*-  v{j  -J-l  :  m); 
end 
end 

In  the  algorithm  above,  the  function  house(x),  where  x  is  an  /i-vector,  computes 
n  ctor  V  such  that  (/  -  2vv‘lv'v)x  is  zero  in  all  but  the  first  component;  the  function 
row.house  (fi,  x)  overwrites  the  matrix  B  with  PB,  where  P  =  I  -  ivv'fv'v. 

In  our  ring  QR  implementation,  the  input  matrix  is  partitioned  and  distributed  among 
processors  in  the  same  way  as  we  did  for  the  Cholesky  factorization.  With  some  modifica¬ 
tions  to  the  ring  Cholesky  algorithm,  we  have  the  following  ring  QR  algorithm  for  the 
processor  i: 

A  Ring  Parallel  QR  Algorithm 

8  4-  1,  ;  1; 

l<ut  4-  t  -I-  (4  -  l)j»; 
next.rid «—  (i  +  1)  mod(p); 
nextJast  *-  nextjrid  +  (4  -  l)p; 
while  s  <n 

if  J  €  {t,  i  +  p,  •  •  ■ ,  last} 

v{j  :  m)  4-  hoxi8e(Ajpc(j  :  m,  j)); 

Store  v(j  :  m)  into  Q{j  :  m,  j); 
send{v{j  :  m), right)] 

AiocU  4)) 

4-  row.hou8e(i4joe(j  :m,j:  k)),v{j  :  m)); 

«  4-  J  +  1,  J  J -I- 1; 

—continued 
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else 

reeeive{v{j  :  m),left)] 
if  s  ^  {nextjrid,  nextjrid  +  ?,*••,  nextJast} 
send{v(j  :  m), right)', 
end 

Store  v[j  ;  m)  into  Q(j  :  m,  j); 
if  d  <  last 

AloeU  •  *) 

♦-  row.hou8e(j4joe(i  :  ^,3  •  ^))>w(j  :  ”»)); 

end 

d  «—  d  +  1; 

end 

end 

Again  we  need  to  explain  the  variables  in  the  above  algorithm.  The  variables  s,  j,  k, 
and  last  have  the  same  meanings  as  in  the  ring  Cholesky  algorithm.  The  variables  next_rid 
and  nextJast  are  the  ring  K)  and  the  largest  global  column  index  for  the  processor  on  the 
ring  that  is  to  the  right  of  ith  processor.  is  to  be  overwritten  by  the  partial  factor  R, 
and  Q  is  used  to  store  the  orthogonal  vectors  v.  The  ring  QR  algorithm  has  a  similar 
structure  to  the  ring  Cholesky  algorithm.  Since  we  need  to  store  the  matrix  R  and  the 
whole  matrix  Q  (in  the  form  of  vectors  v’s  on  each  processor),  the  variable  s  for  the  while 
loop  now  runs  from  1  up  to  the  column  dimension  for  the  input  matrix  A.  If  5  is  not  less 
than  last,  we  only  store  the  received  vector  v  into  Q  and  do  not  update  A,^^.  In  generating 
v(j  :  m),  we  normalized  A,^^  (j  :  ;)  by  its  1  L  norm  to  avoid  overflow  or  underflow. 

The  above  ring  QR  algorithm  is  rich  in  the  level-2  operation  of  matrix-vector  multipli¬ 
cation,  but  not  rich  in  the  level-3  operation  of  matrix-matrix  multiplications;  the  latter 
operation  is  more  desirable  for  reducing  the  memory  traffic  on  many  high-performance 
architectures.  Using  the  idea  of  block  Householder  QR  factorization  (reference  7),  we  can 
modify  the  above  ring  QR  algorithm  to  get  more  level-3  operations.  The  idea  stems  from 
the  fact  that  a  product  of  n  x  n  Householder  matrices  Q  =  Q^  '  "Q,.  can  be  written  in  a 
form  called  the  WY  representation 

2  =  /  +  wr  , 

where  W  and  Y  are  n  x  r  matrices.  More  specifically,  given  an  orthogonal  matrix 
Q  =  I  +  WY'  and  a  Householder  matrix  H  =  I  -2vv'fv'v,  we  have 

Q,  ^  QH  =  I  +  WJ',  , 

where  =  [W,  w]  and  Y  =  [T,  v]  with  w  =  -Qv  /v’v .  For  a  derivative  of  this,  see  reference 
8.  The  basic  idea  of  the  block  Householder  QR  is  to  form  a  product  of  Householder 
matrices  using  the  WY  representation  and  then  apply  this  product  to  the  remaining  un¬ 
updated  columns.  The  block  Householder  QR  algorithm  can  be  found  in  reference  7  or  8. 

To  incorporate  the  block  QR  into  our  ring  QR  algorithm,  we  may  modify  the  while 
loop  as  follows.  When  if  j  e  {i,  i  +  p,  •  •  last)  is  true,  we  only  update  the  j\h  column  in 
the  processor  i.  The  W  and  Y  matrices  are  updated  in  each  while  loop.  At  the  end  of  each 
while  loop,  we  check  the  global  counter  s  to  determine  if  we  need  to  do  a  block  update  on 
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the  local  columns  mod(p)  +  1,  •••,/:].  The  function  row.house  also  needs  some  obvious 
modification.  The  modified  ring  block  QR  algorithm  is  as  follows: 

A  Ring  Block  Parallel  QR  Algorithm 

<  1,  j  1; 

last  ♦-  *  +  (4  -  l)p; 

W  -  y  4-  J; 
next.rid  *-  (i  +  1)  mod(p); 
nextJast  nextjrid  +  (i  -  l)p; 
while  4  <  n 

if  4  e  {»,*  +  ?,•••,  Iasi) 

v{j  :  m)  4-  houae{Aioc(J  :  m,  j)); 

Update  W  and  Y  using  v(;  :  m); 

Store  v{j  :  m)  into  Q(j  :  m,  j); 
send(v{j  :  m), right); 

AiocU  ■ 

4-  row.hou8e(Ajoe(y  :  m,j)),v{j  :  m)); 

4  4-  4  +  1,  i  4-  j  +  1; 

else 

reeeive{v{j  :  m),  left); 

Us  ^  {next-rid, next-rid  +  p, •  •  • , nextJast) 

4end(v(j  :  m),  right); 
end 

Update  W  and  Y  using  received  v{j  :  m); 

Store  v{j  :  m)  into  Q{j  :  m,j); 
end 

if  ((4  mod{p))  =  0  and  4  <  fast; 

+  1 :  m,  4/p  +  1 :  *) 

♦-  row.house(Aioe(4  +  1 :  m,  4/p  +  1 :  i)),  I  +  WY'); 

end 

4  4-4  +  1; 
end 

We  implemented  the  ring  QR  on  the  iWarp  system.  The  modified  block  version  has 
not  been  implemented  at  this  moment  since  we  are  not  sure  that  our  current  system  can 
do  level-3  operations  more  efficiently.  We  also  implemented  QR  on  a  fully  connected 
network  of  nodes.  An  easy  modification  to  the  ring  QR  algorithms  produces  a  QR 
algorithm  for  the  fully  connected  topology.  We  will  discuss  more  about  full  connection  in 
the  next  section. 
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6  A  PARALLEL  MVDR  BEAMFORMER  IMPLEMENTATION 


We  first  show  what  needs  to  be  computed  for  a  MVDR  beamformer  using  the  QR 
factorization.  As  shown  in  section  2,  we  want  to  compute  .  Since  =  XX'  and 

X  =  QR,  we  have 

=  {EXxY'x-^E)-^ 

=  {EXR'Q')-^{QR)-^E)-^ 

=  {{R-^Q'EYiR-^Q'E))-^  =  {z'z)-\ 

where  z  =  R'^Q'E  =  R~^QE  since  Q  is  symmetric.  Thus  we  see  the  MVDR  computation 
using  QR  consists  of  a  QR  factorization,  a  matrix  multiplication  of  Q  with  the  direction- 
of-look  vector  £,  and  a  solution  of  an  upper  linear  triangular  system  z  =  R~^y. 

In  the  last  section,  we  discussed  implementation  of  QR  factorization.  Now  we  want  to 
find  a  parallel  algorithm  for  the  solution  of  a  linear  triangular  system.  Reference  9  gives  a 
detailed  discussion  of  several  parallel  triangular  system  algorithms,  which  are  based  on 
the  following  basic  sequential  algorithms: 


Two  Triangular  System  Algorithms 


for  t  s  1  :  n 

for  7  =  1 :  n 

for  j  =  1  :  1*  -  1 

*i  =  hi 

-  ZjLij 

for  <  =  j  -1- 1 :  n 

end 

hi  —  hi  —  zjLij 

*i  = 

end 

end 

end 

The  algorithms  presented  above  are  for  lower  triangular  systems.  But  an  index  rever¬ 
sal  in  these  algorithms  can  make  them  work  for  an  upper  triangular  system.  Using  the 
terminology  in  reference  9,  the  algorithm  on  the  left  is  called  a  scalar-product  algorithm, 
and  the  algorithm  on  the  right  is  called  the  vector-sum  algorithm.  If  our  sole  task  were  to 
solve  a  triangular  system  on  a  parallel  machine,  we  could  partition  the  matrix  by  either 
columns  or  rows.  To  explore  parallelism  from  the  above  algorithms,  we  have  the  choice 
of  partitioning  the  work  in  the  inner  loop  or  the  outer  loop.  So  there  are  many  possible 
parallel  algorithms.  Since  we  have  to  solve  a  triangular  system  following  the  parallel  QR 
factorization,  we  do  not  have  the  freedom  of  choosing  a  matrix  partition.  In  fact,  we  must 
use  the  column  partition  of  the  upper  triangular  matrix  i?,  which  is  already  distributed 
among  processors  after  the  QR  factorization  is  performed.  We  also  decide  to  parallelize 
the  inner  loop  for  a  relatively  simple  implementation.  It  turns  out  that  the  algorithm  on 
the  above  left  can  be  used  as  the  basis  to  construct  an  algorithm,  called  the  “fan-in 
scalar-product”  algorithm  in  reference  9,  which  parallelizes  the  inner  loop  and  requires  a 
column  partition  of  the  matrix.  The  resulting  algorithm,  with  “fan-in  scalar-product” 
implemented  on  a  unidirectional  ring,  is  as  follows: 
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The  Ring  “Fan-In”  Triangular  System  Algorithm 

for  t  =  1 :  n 

j  =  0 

for  j  =  1  :  »  -  1 

if  j  €  {rid,  rid  +  p,- ••  ,rid+  (k  -  l)j)} 

S  —  S  + 

end 
ift^  1 

n«w-td  =  rid  —  map{iy, 
if  new-id  <  0 

new  Jd  =  ntviAd  +  p\ 
if  new  Jd  =  1 
jend(j,rt^^); 
if  new  Jd  =  0 

reeeive{buffer,  left ); 
j  =  j  +  buffer; 
end 

if  new  Jd  ft  0  and  newJd  ^  1 
receive(buffer,  left); 

*  =  s  +  buffer; 

»end{$, right); 
end 
end 

if  i  €  {rid,  rid + p,  •  •  • ,  rid  +  (*  -  l)j)} 

*i  =  {bi  -  s)/Lii 

end 

In  the  above  algorithm,  rid  is  the  ring  ID  for  node  /.  The  mapii)  is  the  ring  ID  for  the 
node  that  contains  ith  column  of  the  matrix.  The  statements  in  the  if  /  1  section  consti¬ 

tute  the  “fan-in”  operation.  In  each  loop,  what  the  “fan-in”  operation  does  is  to  send  all 
partial  updates  of  to  the  node  with  ring  ID  map(i).  With  a  unidirectional  ring  connec¬ 
tion,  we  implemented  the  “fan-in”  by  computing  a  newJd  that  is  the  “distance”  from  the 
current  node  to  the  node  mapif).  Once  this  newJd  is  computed,  it  is  easy  to  decide  what 
each  node  needs  to  do  in  this  “fan-in”  operation.  As  can  be  seen  from  thi  Igorithm,  the 
parallelism  comes  from  computing  the  partial  updates  of  6.;  the  unkno^vn  jc,s  are  still 
computed  sequentially. 

The  main  disadvantage  of  a  ring  connection  of  nodes  is  the  communication  overhead. 
For  a  unidirectional  ring  with  n  nodes,  its  diameter  \s  n  1.  For  a  full  connection  of  n 
nodes,  its  diameter  is  1.  So  for  a  “fan-in”  operation,  the  unidirectional  ring  connection  is 
less  efficient  than  for  a  full  connection  because  the  former  requires  more  intermediate 
steps  for  passing  a  message  around.  Though  it  requires  more  work  to  set  up  a  full  connec¬ 
tion  of  nodes,  at  least  on  the  iWarp  system,  it  is  easy  to  modify  our  QR  and  triangular 
system  ring  program  so  that  they  run  on  a  full  connection  of  nodes.  We  do  not  list  the 
corresponding  algorithms  for  a  full  connection  of  nodes  (they  are  simpler  and  have 
similar  structures  to  the  ring  algorithms).  We  just  give  their  performance  results  in  the 
next  section. 
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With  the  parallel  QR  and  the  parallel  triangular  system  subroutines,  we  can  now  give 
the  whole  program  for  computing  the  MVDR  beamforming  with  n  direction-of-look 
vectors  *  •  ',EJ  as  follows: 

A  Parallel  MVDR  Beamformer  Program 

Set  up  connection  network  of  nodes  on  the  system; 

Dynamic  memory  allocations  on  each  node; 

Input  array  output  matrix  X  to  node  0; 

Distribute  appropriate  portions  of  X  to  each  node  on  the  network; 

Perform  a  parallel  QR  factorization  on  X; 

for  1  =  1  :  n 

Input  direction-of-look  E.  to  node  0; 

Broadcast  E.  to  each  node  on  the  network; 

Compute  in  parallel  z  =  QE.  ; 

Compute  in  parallel  s  =  R~^2  and  s^; 

Gather  components  of  /  from  each  node  on  the  network  to  node  0; 

Output  1/s^  from  node  0 

end 

We  used  dynamic  memory  allocation  in  the  above  program.  Allocating  memory  at  run 
time  makes  the  program  more  robust  and  storage  structures  more  flexible.  We  assume 
that  node  0  is  the  only  processor  that  does  external  I/O.  We  also  see  that  no  explicit 
global  synchronization  is  needed  to  ensure  the  parallel  program  function  correctly  as  long 
as  every  interprocessor  message  passing  proceeds  correctly. 
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7  PERFORMANCE  MEASUREMENTS 


In  this  section,  we  present  performance  measurements  on  the  iWarp  system  of  the 
parallel  algorithms  we  have  discussed.  All  floating-point  operations  were  performed  using 
single  precision.  Because  each  node  on  our  current  iWarp  system  has  0.5  megabyte  of 
physical  memory,  we  choose  the  data  matrix  X  for  performance  measurements  to  be  128 
X  128.  We  first  want  to  point  out  that,  with  the  current  software  release  (e.g.,  the  com¬ 
piler)  on  the  system,  the  iWarp  node  performance  on  a  C  code  is  low  compared  with  its 
theoretical  (peak)  performance.  For  example,  we  have  run  a  matrix  multiply  C  program 
on  a  single  node  with  available  optimizations  and  obtained  a  performance  of  about  0.65 
MFLOPS.  Hence,  we  do  not  expect  good  MFLOPS  performance  from  our  parallel  pro¬ 
gram  on  the  iWarp  system.  Relatively  speaking,  the  iWarp  system  has  good  bandwidth  for 
interprocessor  communications.  It  is  interesting  to  see  the  speed-ups  we  can  get  from  our 
parallel  algorithm  implementations  on  the  system.  For  comparison,  we  also  measured 
corresponding  performances  on  a  Convex  C-240;  we  just  used  one  vector  processor  on 
that  system.  Using  the  usual  terminology  in  parallel  processing,  speed-up  is  the  ratio  of 
performance  on  a  single  processor  to  the  performance  on  multiple  processors,  and 
efficiency  is  the  ratio  of  speed-up  to  the  number  of  processors  used. 

Table  1  lists  the  performance  measurements  for  a  Cholesky  factorization  using  a  uni¬ 
directional  ring  connection.  It  shows  that  we  got  a  speed-up  of  3.3  and  an  efficiency  of 
83%  using  4  nodes;  we  got  a  speed-up  of  7.1  and  an  efficiency  of  44%  using  16  nodes. 


Table  1.  Performance  measurements  for  a  Cholesky  factorization. 


Machine 

Cholesky  Factorization  (128  x  128) 
tf  Processors  CPU  time  (seconds) 

Speed-ups 

iWarp 

1 

2.00 

iWarp 

4 

0.60 

3.3 

iWarp 

16 

0.28 

7.1 

Convex 

1 

0.09 

Table  2  lists  the  performance  measurements  for  a  QR  factorization  using  a  unidirec¬ 
tional  ring  connection.  It  shows  a  speed-up  of  3.6  and  an  efficiency  of  90%  using  4  nodes; 
a  speed-up  of  10.3  and  an  efficiency  of  64%  using  16  nodes. 


Table  2.  Performance  measurements  for  a  QR  factorization. 


Machine 

Cholesky  Factorization  (128  x  128) 

#  Processors  CPU  time  (seconds) 

Speed-ups 

iWarp 

1 

7.20 

iWarp 

4 

2.00 

3.6 

iWarp 

16 

0.70 

10.3 

Convex 

2 

0.22 

16 


Table  3  lists  the  performance  measurements  of  a  MVDR  with  QR  using  a  unidirec¬ 
tional  ring  connection.  In  performance  measurements  for  MVDR,  we  use  only  one  direc- 
tion-of-look  vector  E  in  the  program.  It  shows  we  got  a  speed-up  of  3.3  and  an  efficiency 
of  83%  using  4  nodes;  a  speed-up  of  5.8  and  an  efficiency  of  34%  using  16  nodes. 


Table  3.  Performance  measurements  for  a  MVDR  with  QR. 


Machine 

Cholesky  Factorization  (128  x  128) 

#  Processors  CPU  time  (seconds) 

Speed-ups 

iWarp 

1 

11.5 

iWarp 

4 

3.50 

3.3 

iWarp 

16 

5.8 

Convex 

1 

0.40 

Table  4  lists  the  performance  comparisons  for  Cholesky,  QR,  and  MVDR  using  a  full 
connection  of  four  nodes  with  a  unidirectional  ring  connection.  For  the  Cholesky  algo¬ 
rithm,  its  performance  on  a  full  connection  of  nodes  is  about  30%  faster  than  its  perform¬ 
ance  on  a  ring  connection.  For  the  QR  algorithm,  we  do  not  see  much  performance 
difference  between  a  full  connection  and  a  ring  connection.  The  performance  difference 
for  MVDR  is  about  20%  with  a  full  connection  over  a  ring  connection. 

Table  4.  Performance  comparisons  for  Cholesky,  QR,  and  MVDR. 


Comparisons  between  Full  and  Ring  Connection  Using  Four  iWarp  Processors 
Application _ Type  of  Connection _ CPU  time  (seconds) 


Cholesky 

Ring 

0.60 

Cholesky 

Full 

0.43 

QR 

Ring 

2.00 

QR 

Full 

1.93 

MVDR 

Ring 

3.50 

MVDR 

Full 

2.80 

I 
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8  REMARKS  AND  FUTURE  INVESTIGATIONS 

The  design  of  iWarp  node  architecture  and  the  system  physical  network  as  a  two- 
dimensional  mesh  is  intended  to  make  it  suitable  for  both  fine-grain  computations  and 
coarse-grain  computations  (reference  3).  For  the  matrix  size  we  used,  our  problem  may 
be  seen  as  a  fine-grain  application.  We  think  the  good  speed-ups  on  Cholesky  and  QR 
factorizations  using  four  nodes  show  the  iWarp  system’s  communication  bandwidth  is 
fast.  It  is  also  reasonable  to  see  that  efficiency  goes  down  as  the  number  of  nodes 
increases  on  a  unidirectional  ring,  since,  except  for  some  hignly  pipelined  types  of  appli¬ 
cation,  the  communication  overhead  increases  with  a  :  rger  ring  radius.  We  expect  the 
efficiency  of  our  application  would  be  greater  on  a  more  richly  connected  topology  like  a 
bidirectional  ring  or  a  two-dimensional  mesh.  Due  to  the  tedious  work  involved  in  ng 
a  fully  connected  network  of  nodes,  we  only  set  up  a  fully  connected  network  containing 
four  nodes  to  compare  its  performance  with  a  unidirectional  ring.  Though  we  do  not  see  a 
significant  performance  difference  on  the  QR  factorization,  which  implies  the  communi¬ 
cation  overhead  is  very  small  for  that  application  on  a  four-node  ring,  we  see  an  obvious 
performance  difference  on  the  MVDR.  This  difference  clearly  came  from  the  tri  ngular 
system  solver.  It  seems  to  us  that  the  parallel  triangular  system  algorithm  is  “finer  grain” 
than  is  the  parallel  QR  algorithm,  so  the  former  should  perform  better  on  a  more  richly 
.  mnected  topology. 

Compared  with  the  performance  of  some  vector  machines  in  a  high-level  language 
code  (like  C  or  Fortran),  we  find  our  current  iWarp  system  is  still  not  a  fast  machine. 
This  is  not  surprising  considering  the  fact  that  each  iWarp  node  is  not  a  vector  processor 
and  the  software  technology  for  the  system  is  not  very  mature  yet.  At  this  time,  however, 
we  can  use  the  iWarp  system  to  test  and  evaluate  parallel  algorithms  using  different 
connection  topologies  and  gain  parallel  programming  experience. 

We  think  a  few  things  are  worth  trying  on  the  iWarp  system  in  the  near  future.  It 
would  be  interesting  to  try  a  larger  size  problem  and  thus  use  more  processors  for  per¬ 
formance  testing,  which  may  require  us  to  upgrade  memory  on  at  least  some  of  the 
processors.  We  are  also  interested  in  developing  and  implementing  systolic  algorithms 
that  use  two-dimensional  mesh— a  natural  choice  for  the  iWarp  system.  We  would  also 
like  to  modify  and  run  our  application  codes  on  the  Intel  Touchstone  system  so  that  a 
performance  comparison  can  be  made  for  these  two  parallel  MIMD  machines. 
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APPENDIX  A 


MVDR  CODE 


A-1 


/* 

*  A  Head  File  for  MVDR  Beamformer 
V 


j?  include 
S include 
# include 
# include 
jf  include 
# include 
# include 
H include 
H include 


<sys/tiitie .  h> 
<stdio . h> 

<math . h> 

< iwarp_cc . h> 
<gates . h> 

<  regnums . h> 
<asm/gen_asm . h> 
<asm/pw_asm . h> 
<pathl ib/pl . h> 


/*  Define  the-  scale  of  the  problem  !  */ 

#define  maxcell  4  /*  total  number  of  iWarp  cells  to  use  */ 

^define  rdim  128  /*  row  dimension  of  data  matrix  A  */ 
#define  cdim  128  /*  column  dimension  of  data  matrix  A  */ 
^define  srdim  rdim  /*  row  dimension  of  storage  */ 

idefine  scdim  cdim  /*  column  dimension  of  storage  */ 


/*  Define  date  type  */ 
typodef  float*  vector; 
typedef  vector*  matrix; 

/*  Function  declarations  */ 

void  main ( )  ; 

void  fl_connt() ; 

void  de_connt()»‘ 

void  assign_chan ( ) ; 

void  send^a_buf ( ) ; 

void  receive  a_buf(); 

void  input_A7) ; 

void  get_space() ; 

void  distribute_A() ; 

void  distribute_xv ( ) ; 

void  qr ( ) ; 

void  store  qq(); 

void  gen_gT) ; 

int  imax ( )  ; 

void  updateO; 

void  mvdr(); 

void  tri_soiver ( ) ; 

void  gather_g{); 

void  print_g(); 

void  print_qq(); 

/*  Other  global  definitions  */ 

char  *malloc ( ) ; 

int  _cellid,  ringid; 

int  elsize;  /*  size  of  a  double  precision  variable  */ 
int  c[maxcell]; 
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/* 

*  A  Four-ProcQssor  Proqram  Ucing  Pathl.ib  Communication  to  Implement 

*  a  MVDR  beamformor  using  a  Ring  Connection. 

* 

*  Input:  a  data  matrix  A  and  a  column  vector  (direction  of  look)  x. 

*  Output:  x'inv(AA')x  (  '  means  transpose),  x  has  dimension  of  row(A). 

■k 

*  Computation  procedure: 

*  1)  A  =  QR,  Q  =  orthogonal,  R  =  upper-triangular; 

*  2)  form  the  quantity  z  =  inv(R)Q'x; 

*  3 )  compute  b  =  z ' z ; 

*  4)  output  b. 

V 

/*  This  file  contains  main(),  ring(),  de_ring(),  send_a_buf(), 
receive_a_buf { ) .  */ 


((include  "mvdr.h"  /*  mvdr.h  contains  'include  files'  and 

global  declarations  */ 


long  sec,  usee; 

int  i_chan,  o_chan,  headers,  header_count ; 

float  exe_time; 

struct  timeval  tpl,  tp2 ; 

struct  timezone  tzpl,  tzp2; 

void  main() 

/*  variables  definitions  */ 

matrix  a;  /*  data  matrix  A  or  data  sub  matrix  of  A  */ 

matrix  qq;  /*  triangular  Q  factor  matrix  (for  storing  ql,...,qn)  */ 

vector  XV ;  /*  direction  of  look  column  vector  */ 

int  ncols; 

float  out;  /*  output  value  b  */ 

/*  Initialize  Pathlib  */ 

PL_INIT(_cellid,  Oxfffc,  Oxff) ; 

pl~rpe_configure(0x0030,  OxOOcO,  0x0300,  OxOcOO,  OxfOOO) ; 

elsize  =  sizeof (float) ; 
ncols  =  cdim/maxcell ; 

/*  Set  up  the  Ring  */ 
ringO  ; 


/*  Input  matrix  A  and  column  vector  xv  to  cell  0  and  allocate 
space  for  other  cells  */ 
if  (ringid  ==  o) 

input_A(&a,  &qq,  &xv,  srdira,  sedim,  ncols) ;  /*  input  A  to  cell  0  */ 
else 

get_space (&a ,  &qq,  Srxv,  srdim,  ncols)  ;  /*  allocate  space  for  cells 

other  than  cell  0  */ 

/*  Distribute  columns  of  A  to  appropriate  cells  */ 
distribute_A(a,  ncols) ; 

/*  Parallel  computing  QR  decomposition  begins  ...  */ 
qr(a,  qq,  ncols) ; 

/*  end  of  computing  QR:  R  distributed  in  a's,  Q  stored  in  qq  */ 

/*  Distribute  vector  xv  to  all  cells  */ 
distribute_xv(xv,  ncols)  : 
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/*  Parallel  MVDR  begins  ...  */ 

Tnvdr(a,  qq,  xv,  ncols,  &out)  ; 

/*  Output  MVDR  value  stored  in  'out'  */ 
if(ringid  ==0)  { 

out  =  1.0/out; 

printf("***  output  =  %.4f  ***\n",  ringid,  out) ; 

) 

/*  Destroy  the  ring  connection  */ 
de_ring ( ) ; 

exit  (0)  ; 

)  /*  main  */  • 


/*  Set  up  a  ring  with  four  cells  */ 

void 

ringO 

I 

int  next,  o_dir; 
int  header; 

int  rowl_last=l,  row2_f irst=8 ,  row=8 ; 


if  (_cellid  <  rowl_last)  { 
next  =  _cellid  +  1; 
o  dir  =  PL_DIR_XR; 
ringid  =  _cellTd; 

) 

if  (_cellid  ==  rowl_last)  { 
next  =  9 ; 

o^dir  =  PL_DIR^YD; 
ringid  =  _cellld; 

if  (_cellid  ==  row2_first)  { 
next  =•  0 ; 

o^dir  =  PL_DIR_YU; 
ringid  =  3; 

if  (_cellid  >  rowl_last  fit  _cellid  1=  row2_flrst)  ( 
next  =  _cellid  -  1; 
o  dir  =  pl_DIR_XL; 
ringid  =  2 ; 

) 


/*  Set  up  logical  channels  to  form  a  Ring  */ 

/*  Each  cell  does  an  accept  and  a  create  connection  */ 

ochan  =  pl_create_connection (GATEOOOT,  o_dir,  fiinext,  1) ; 
(void)  pl_produce  message  header (GATEO_OUT,  0) ; 
i_chan  =  pl_accepI_connecEion  (GATEI  IN,  Siheader)  ; 

(void)  pl_consume_message_header (GATE1_IN,  Siheader)  ; 

)  /*  Ring  */ 


void 

de_ring ( ) 

{ 

/*  Terminate  the  output  channel  */ 

(void)  pl_produce_message_trailer (GATEO_OUT) ; 
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(void)  pl_destroy_connection (o_chan,  GATEO_OUT) ; 
/*  Terminate  the  input  channel  */ 

(void)  pl_consume_message  trailer (GATE1_IN) ; 
(void)  pl_conclude_connection  (i_chan,  GATE1_IN)  ; 


void 

receive  a_buf(pbuf,  N) 
char  *pBuf; 
int  N; 

int  qt,  rd,  k,  ibuf; 
int  *ip  =  (int  *)  pbuf; 

qt  =  N/4 ,  rd  =  N  -  qt*4 ; 
for  (k=0;  k<qt;  k++)  { 

ibuf  =  _receivei (GATE1_IN) ; 
*ip++  =  ibuf; 

) 

if  (rd  >  0)  { 
char  *pt; 

char  *cp  =  (char  *)ip; 
ibuf-  raceivQi(GATEl_IN) ; 
pt  =  (cHar  *)ibuf; 
for  (k=0;  k<rd;  k++)  { 

*cp++  =  *pt++; 

) 

) 

)  /*  receive_a_buf  */ 


void 

^  a_buf (pbuf,  N) 

'*pbuf ; 

N; 

int  qt,  rd,  k,  ibuf; 
int  *ip  =  (int  *)  pbuf; 

qt  =  N/4,  rd  =  N  -  qt*4; 
for  (k=0;  k<qt;  k++)  { 
ibuf  =  *ip++; 
_sendi(GATEO_OUT,  ibuf) ; 


(rd  >  0)  { 

ihar  *pt  =  (char  *)ibuf; 
char  *cp  =  (char  *)ip; 
for  (k=0;  k<rd;  k++)  { 
*pt++  =  *cp++; 

) 

_sendi (GATEO_OUT,  ibuf) ; 

) 

)  /*  send_a_buf  */ 
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*  A  few  subroutines  called  by  mvdr.c: 

*  strcpyO,  input_A(),  get_space  ( )  ,  distribute_A  ()  ,  qr(), 

*  gather_gp»  print_r(). 

*  a[j][i]  is  the  element  on  jth  column  and  ith  row. 

*/ 

Sinclude  "mvdr.h" 

void  strcpy(sl,  s2,  size) 
char  *sl,  *s2; 

int  size;  /*  number  of  chars  in  the  string  */ 

( 

int  i; 

for(i=0;  i<size;  i++) 
sl[i]  =  s2[i] ; 

)/*  strcpy  */ 

void 

input_A(pa,  qq,  xv,  rn,  cn,  cols) 

/*  rn,  cn  =  row  and  column  dimension  of  storage  */ 

matrix  *pa,  *qq; 

vector  *xv; 

int  rn,  cn,  cols; 

/*  I/O  files  named  'stdinO'  and  'stdoutO'  must  exist  in 
the  current  directory.  Using  Igo  with  -s  option 
to  redirect  stdin  and  stdout  to  stdlnO  and  stdoutO  */ 

matrix  mp,  mpl;  /*  mp,  mpl  are  to  traverse  the  matrix  columns  */ 
vector  vp; 
int  i,  j; 

/*  create  an  array  of  column  pointers  storage  to  store  A  (and  output)  */ 
*pa  =  (matrix)  malloc(sizeof (vector) *cn) ,  mp  =  *pa; 

*qq  =  (matrix)  malloc(sizeof (vector) *rn) ,  mpl  =  *qq; 

*xv  =  (vector)  malloc(elsize*rdim) ;  vp  =  *xv; 

/*  input  matrix  A  */ 
for(j=0;  j<cn;  j++) 

{ 

/*  input  a  column  vector  */ 
n'P[j]  =  (vector)  malloc (elsize*rn)  ; 
for (1=0;  i<rn;  i++) 

{ 

/*  fscnnL’ (stdin,  S>mp(J][l]);  */ 

if (i==j) 
mpC j ] [i]  =  150; 
else 

n>P(j][i]  =  1; 

) 

) 

/*  input  direction  of  look  vector  xv  */ 
for(i=0;  i<rn;  i++) 
vp[i]  =  (float)  i; 

/*  allocate  space  for  Q  factor  */ 
for(j=0;  j<rn;  j++) 

i^Pltj]  =  (vector)  malloc(elsize*  (rn  -  j  +  1)); 

/*  last  element  of  vector  qq[j]  stores  the  norm  squared  of  qq_j  */ 

)  /*  input_A  */ 
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void 

get_space (pa ,  qq,  xv ,  rn,  cn) 
matrix  *pa,  *qq; 
vector  *xv; 
int  rn,  cn; 

{ 

matrix  mp,  mpl; 
int  j  ; 

/*  create  an  array  of  column  pointers  storage  */ 

^pa  =  (matrix)  malloc (sizeof (vector) *cn) ,  mp  =  *pa; 

*qq  =  (matrix)  malloc (sizeof (vector) *rn) ,  mpl  =  *qq; 

*xv  =  (vector)  malloc (elsize*rn) ; 
for  (:)=0;  j<cn;  j++) 

n>PCl]  =  (vector)  malloc (elsize*rn)  ; 

/*  allocate  space  for  Q  factor  */ 
for  (j=0;  j<rn;  j++) 

J'lpiCj]  =  (vector)  malloc (elsize*  (rn  -  j  +  1)); 

/*  last  element  of  vector  qq[j]  stores  the  norm  squared  of  qq_j  */ 
)  /*  get_space  */ 


void 

distribute_A(a,  cols) 
matrix  a; 
int  cols; 

{ 

matrix  mp  =  a, 

tmp  =  a;  /*  tmp  is  to  arrange  columns  for  cell  o  */ 
int  i,  j; 

int  num_send  =  maxcell  -  ringid  -  1, 
num_recv  =  nura_send  +  1 ; 
float  buf[rdim]; 

if (ringid==0) 
num_recv  =  0 ; 

/*  the  matrix  A  has  cdim  =  maxcell*cols  columns  */ 

Cor(i-0;  i<cols;  1(+) 

{ 

/*  send  out  columns  from  cell  0  */ 
if (ringid==0) 

{ 

/*  keep  1st  ono  of  ovory  maxcoll  columns  */ 

strcpy({char  *)  tmpfi],  (char  *)  mp[ i*maxcoli ] ,  rdim*elsize) 
for(j=0;  j<num_send;  ]++) 

send_a_buf ( (char  *) mp[ j+l+i*maxcell ] ,  rdim*elsize) ; 

) 

/*  receive  and  send  columns  on  cells  1=  0  */ 
if (num_recv) 

{ 

for(j=0;  j<num_recv;  j++) 

( 

if(j==0) 

receive_a_buf ( (char  *)mp(i],  rdim*elsize) ; 
else 

{ 

receive  a  buf((char  *)buf,  rdim*elsize) ; 
send_a_Eu7( (char  *)buf,  rdim*elsize) ; 

) 

) 

) 

)  /*  index  i  */ 
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)  /*  distribute_A  */ 


void 

distribute_xv(xv,  cols) 
vector  XV ; 
int  cols; 

{ 

int  num  send  =  1,  nuin_recv  =  1  ; 
if (ringTd==0) 
num_recv  =  0 ; 

if(  ( (ringid+1) %maxcell)  ==  0) 
nuni_send  =  0  ; 
if (num  recv) 

receTve_a  "buf  (XV,  elsize*rdim)  ; 
if (num_sendy 

send_a_buf (XV,  elsize*rdim) ; 

)  /*  distributexv  */ 


void 

qr(a,  qq,  cols) 
matrix  a,  qq; 
int  cols; 


*  variable  definitions: 

*  last - global  index  for  the  last  vector  stored  in  proc(u)  . 

*  k  -  global  index  of  the  column  vector  currently  working  on. 

*  j  -  global  index  of  last  locally  generated  g-column. 

*  jl  -  local  index  of  next  column  of  loacal_A  that  will 

*  produce  a  g-column. 

*/ 


int  last  *=■  ringid  +  (cols  -  l)*maxcell; 
int  next_rid  ■=  (ringid  +  1)  %  maxcell; 

/*  next^rid  =  ringid  for  the  next  cell  on  the  ring  */ 
int  next_last  =  next_rTd  +  (cols  -  l)*maxcell; 
int  k,  jl,  kp; 
int  i,  nol; 
matrix  mp  =  a; 
float  *pt,  norm; 

float  vpl[rdimi;  /*  tmp.  storage  vector  */ 


/*  initializing  index  and  pointer  variables  */ 
k=0,  jl=0,  kp=0; 
while  (k  <  rdim  -  1) 


kp  =  (k  -  ringid)  %  maxcell; 
if(kp==0)  /*  its  turn  comes  */ 

{ 

/*  generate  a  vcolumn  based  on  kth  column  and 
copy  it  to  vpl  */ 
gen_q(mp[ jl] ,  vpl,  k) ; 
store_qq(qq,  vpl,  k) ; 

send  a_buf((char  *)  vpl,  (rdim-k) *elsize) ; 
updaEe(a,  vpl,  jl,  cols,  rdim-k); 
k++,  jl++; 

I 

else 


receive_a_buf ( (char  *)vpl,  (rdim-k) *elsize) ; 
kp  =  (k  -  nextrid)  %  maxcell; 
if(kp  !=  0) 
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send_a_buf ( (char  *)  vpl,  ( idim-k) *elsize) 
store_qq(qq,  vpl,  k) ; 
if(k  <=  last) 

updatG(a,  vpl,  jl,  cols,  rdim-k) ; 
k++; 

) 

)  /*  while  */ 

)  /*  qr  */ 


void 

store_qq(qq,  vpl,  k) 
matrix  qq; 
vector  vpl; 
int  k; 

( 

int  jl,  nel; 
float  norm  =  0.0; 

nel  =  rdim  -  k; 
for(jl=0;  jl<(rdim-k);  jl++) 
norm  +=  vpl[ j l ] *vpl [ j 1] ; 
qq[k][nel]  =  norm; 

strcpy((char  *)  qq(k],  (char  *)  vpl,  elsize*nel) ; 
/*  if (ringid==0)  { 

printfC'k  =  %d,  norm  =  %.2f\n'',  k,  norm); 
printf(''vpl  =  \n”)  ; 
for{jl=0;  jl<nel;  jl++) 
printf("%.lf  ",  vpl[jl]); 
printf  (''\n'')  ; 
printf  ("qqf  0]  =  \n'')  ; 
for(jl=0;  ]l<nel+l;  jl++) 
printf (•'%. If  ”,  qq[k][jl]); 
printf ("\n”)  ; 

) 

*/ 

)  /*  store_qq  */ 


void 

gen  g(vp,  vpl,  j) 
vector  vp; 
float  vpl[ ] ; 

int  j;  /*  jth  column  we  are  working  on  */ 

{ 

int  i=0,  n=0,  ind=0; 

float  max,  max2,  sign  =  1.0,  norm  =  0.0; 

ind  =  imax(vp,  j,  rdim); 
if ((max  =  vp[ind])  ==  0) 
printf ("max  =  0!!!\n”); 
max2  =  max*max; 
for(i=j;  i<rdim;  i++) 

norm  +=  vp[ i] *vp[ i]/max2 ; 
norm  =•  sqrt(norm); 
if(vp(j]  <  0) 
sign  =  -1.0; 
for(i=j;  i<rdim;  i++) 

‘  if(i=*j) 

vpl[n]  =  vp[i]/max  +  sign*norm; 
else 


A-10 


vpl[n]  =  vp[i]/max; 
n  +  +  ; 

) 

)  /*  gen_g  */ 


/*  find  the  index  of  max  component  of  v  between  nl  <=  index  <  n2  */ 
int  imax(v,  nl,  n2) 
float  *v; 
int  nl,  n2 ; 

{ 

int  i,  index  =  nl; 

float  val  =  fabs ( (double)  v(nl]); 

for(i=nl;  i<Vi2;  i++) 

if (val  <  fabs ( (double)  v[i])) 

{ 

val  =  fabs ( (double)  v(i]); 
index  =  i ; 

) 

return  index; 

)/*  imax  */ 


void 

update(a,  vpl,  m,  cols,  nel) 

matrix  a; 

float  vpl [ ] ; 

int  m,  cols,  nel; 

/*  m  a  local  column  index  of  first  remaining  vectors  to  be  updated  */ 
/*  nel  =  number  of  elements  in  received  vector  vpl  */ 

{ 

matrix  mp  =  a; 
int  j=0, 

Tcol  =  rdim  -  nel;  /*  column  index  of  received  column  */ 
int  idx,  /*  global  row  index  of  a  matrix  element  */ 

jdx,  /*  global  column  index  of  a  matrix  element  */ 
jp,  ip;  /*  indices  for  vector  vpl  */ 
float  norm  =  0.0,  vx  =  O.O; 

for(j=0;  j<nel;  j++) 
norm  +=  vpl( j ] *vpl[ j ] ; 

/*  update  mth  to  cols-l'th  columns:  mp[m]  ->  mp(col-l]  */ 
for(j=m;  j<cols;  j++) 

I 

vx  =  0.0,  ip  =  0; 

for (idx=jcol ;  idx<rdim;  idx++) 

{ 

vx  +=  mp( j ] [ idx] *vpl ( ip] ; 
ip++; 

) 

vx  =  2*vx/norm;  /*  coeff.  */ 
ip  =  0; 

for(idx=jcol ;  idx<rdim;  idx++) 

{ 

mp[j](idx]  -=  vx*vpl(ip]; 
ip++ ; 

) 

) 

)  /*  update  */ 


void 

mvdr(a,  qq,  xv,  cols,  out) 
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matrix  a,  qq; 
vector  XV ; 
int  cols; 
float  *out; 

( 

int  1 ,  j  ; 

float  cf  =  0.0,  fbuf  =  0.0; 

float  *sol  =  (float  *)  mal loc ( sizeof { float ) *cols) ; 
int  nrecv  =  0; 

/*  compute  Q'xv  */ 
for(j=0;  j<rdim  -  1;  j++) 

{ 

cf  =  0.0; 

for(i=0;  i<{rdim  -  j);  i++) 
cf  +=  xv[]+i] *qq[: ] ( i] ; 

cf  =  2*cf/qq( j ] [ rdim  -  ]];  /*  qq[j][rdim  -  D]  =  norm  squared 

of  qq[j]  */ 

for(i=0;  i<(rdim  -  j);  i++) 
xv[j+i]  -=  cf*qq(:](i]; 

) 

/*  now  XV  contains  Q'xv  */ 


/*  if(rlngid  ==  3) 

for(i=0;  i<rdim;  i++) 

printf ("ringid  %d:  i  =  %d,  xv[i]  =  %.2f\n",  ringid,  i,  xv(i]); 

V 

/*  compute  inv(R)z,  z  =  Q'xv  */ 

/*  each  cell  contains  corresponding  components  of  the  right-hand  side 
vector  computed  from  Q'xv  */ 

tri_solver (a,  cols,  xv,  sol) ; 

*out  =  0.0; 

for  (i=0;  i<cols;  i++) 

*out  +=  sol [ i ] *sol ( i] ; 

printf ("ringid  =  %d,  out  =  %.2f\n",  ringid,  *out) ; 

/*  collect  output  to  cell  0  */ 
fbuf  -  0.0; 
if(ringid  1=  1) 
nrecv  =  1 ; 
if(rtngid  ==  1) 

sond_Q_buf ( (char  *)  out,  clsizo) ; 
if(nrecv  ==  1  &&  ringid  1=  0) 

{ 

receive_a_buf ( (char  *)&fbuf,  elsize) ; 
fbuf  +=  *out; 

send_a_buf ( (char  *)&fbuf,  elsize); 

) 

if (ringid==0) 

recelve_a_buf  (  (char  *)£ifbuf,  elsize); 

♦out  +=  fbuf; 

) 

if(ringid  ==  0) 

printf ( "ringid  =  %d,  total  out  =  %,2f\n",  ringid,  *out) ; 


)  /*  mvdr  */ 
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/*  for  upper-triangular  matrix  */ 
void 

tri_solver (a,  cols,  xv,  sol) 
matrix  a; 
int  cols; 

vector  XV,  sol;  /*  sol [cols]  is  the  sub-solution  vecter  */ 

{ 

int  i,  j,  idx,  jdx,  nid,  mapi,  nrecv; 
int  last  =  ringid  +  (cols  -  lj*maxcell; 
float  tmp  =  0.0,  fbuf  =  0.0; 

for(i=rdim-l;  i>=0;  i — ) 

{ 

tmp  =  0.0,  fbuf  =  0.0; 
mapi  =  i%maxcell; 

for ( j=ringid ;  j<=last;  j+=maxcell)  /*  for  j  in  mycols  */ 
if (j>i) 

( 

idx  =  j/maxcell;  /*  local  col  index  */ 
jdx  -  1;  /*  row  index  */ 
tmp  +••  a (  idx]  ( jdx]  *sol  [  idx]  ; 

} 

/*  fan-in  begins  ...  */ 
if (i  1=  (rdim  -  1) ) 

( 

/*  compute  new  origin  */ 
nid  =  ringid  -  mapi; 
if (nid<0) 

nid  +-  maxcell; 
if(nid  ==  1) 

send_a_buf ( (char  *)&tmp,  slzeof (float) ) ; 
if(nid  1=  1  &&  nid  !=  O) 

( 

receive_a_buf ( (char  *)&fbuf,  sizeof (float) ) ; 
tmp  +=  fbuf; 

send_a_buf ( (char  *)&tmp,  sizeof (float) ) ; 

if  (nid  «  0) 

{ 

receive  a_buf((char  *)&fbuf,  sizeof ( float) ) ; 
tmp  +=  7buf; 

) 

) 

/*  end  of  fan-in  */ 

if ( 1 ( (i-ringid) tmaxcell) )  /*  if  i  in  mycols  */ 
sol [i/maxcell]  =  (xv[i]  -  tmp)/a [ i/maxcell ] [i] ; 

)  /*  for  i=l  to  n  */ 

)  /*  tri_solver  */ 


void 

gather_g(a,  cols) 
matrix  a; 
int  cols; 

matrix  mp  =  a; 
int  i,  j; 

int  num_send  =  ringid, 

num_recv  ■=  num_send  -  1  ; 
float  buf(rdim]; 

if (ringid==0) 

num  reev  =  maxcell  -  1; 
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if {num_send) 

{ 

/*  send  out  cols  of  local  columns  */ 
for(i=0;  i<cols;  i++) 

send_a_buf ( (char  *)  mp[cols  -  1  -i],  rdim*elsize) ; 

) 

if(num_recv  &&  (ringid  !=  0)) 

{ 

for(i=0;  i<nura_recv;  i++) 
for(j=0;  j<cols;  j++) 

{ 

receive  a  buf((char  *)  buf,  rdim*elsize) ; 
send_a_Eiu7(  (char  *)  buf,  rdim*elsize)  ; 

) 

) 

if (ringid==0) 

I 

/*  redistribute  local  columns  in  cell  0  */ 
for(i=l;  i<cols;  i++) 

strcpy((char  *)  mp[i*maxcellj ,  (char  *)  mp[i],  rdim*elsize) ; 
for(i=l;  i<=num  recv;  i++) 
for(j=0;  j<coTs;  j++) 

receive_a_buf ( (char  *)  mp[rdim  -  i  -j*maxcell],  rdim*elsize) ; 
)  /*  if  V 
)  /*  gather_g  */ 


void  free_mem(a,  cn) 
matrix  a; 
int  cn; 

{ 

int  1 ; 

for(i=0;  i<cn;  i++) 
free(a[i] ) ; 
free (a)  ; 

)  /*  free_mera  */ 
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